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Abstract. This paper studies the problem of instruction assignment and scheduling 
on spatial architectures. Spatial architectures are architectures whose resources are or- 
ganized in clusters, with non-zero communication delays between the clusters. On these 
architectures, instruction scheduling includes both space scheduling, where instructions 
are mapped to clusters, and the traditional time scheduling. This paper considers the 
problem from both the theoretical and practical perspectives. It presents two integer 
linear program formulations with known performance bounds. We also present an 8- 
approximation algorithm for constant m and constant communication delays. Then, 
we introduce three heuristic algorithms based on list scheduling. Then we study a layer 
partitioning method. Our final algorithm is a combination of layer partitioning and 
the third heuristic. Two of the better algorithms are evaluated on the Raw machine. 
Results show that they are competitive with previously published results; for scientific 
codes, our heuristics can perform an average of 25% better. 



1 Introduction 

Spatial architectures are becoming increasingly important because they address the prob- 
lem that wire delays do not scale with technology [1]. Signals already take more than one 
cycle to cross a chip today, and the delay will only get worse in future technology. Spatial 
architectures address this problem by organizing their resources into replicated units on the 
chip. Communication within a unit takes one cycle, but communication between units incurs 
one more or additional cycles of delays. Examples of spatial architectures include clustered 
VLIWs, Raw [23], Grid processors [20], and ILDPs [11]. 

Instruction scheduling is an important optimization problem on spatial architectures. On 
these architectures, the instruction scheduler has to partition instructions across the comput- 
ing resources. While instruction schedulers on traditional architectures only need to assign 
instructions to time, on spatial architectures they need to assign instructions in both space 
and time. We call this combined scheduling problem space time scheduling. 

Like most practical instances of instruction assignment and instruction scheduling is known 
to be NP complete [22]. Thus, in practice space-time schedulers are based on heuristics. 
To make partitioning decisions, a heuristic scheduler has to understand the proper tradeoff 
between parallelism and locality. Figure 1 shows an example of this tradeoff. Spatial scheduling 
by itself is already a more difficult problem than temporal scheduling, because a small spatial 
mistake is generally more costly than a small temporal mistake. If a critical instruction is 
scheduled one cycle later than desired, only one cycle is lost. But if a critical instruction is 
scheduled one unit of distance farther away than desired, cycles can be lost from unnecessary 
communication delays, additional communication resource contention, and increase in register 
pressure. In addition, some instructions on spatial architectures may have specific spatial 
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Fig. 1. An example of tradeoff between parallelism and locality on spatial architectures. Rectangles 
are instructions and edges between rectangles represent data dependences. The circles on the left 
represent the time axis. Consider an architecture with three clusters, each with one functional unit, 
where communication takes one cycle of latency plus one receive instruction. In (a), conservative 
partitioning that maximizes locality and minimizes communication leads to an eight-cycle schedule. 
In (b), aggressive partitioning leads to too much communication an eight-cycle schedule. The optimal 
schedule, Figure (c), takes only seven cycles, and it requires a careful tradeoff between locality and 
parallelism. 



requirements. These requirements arise from the need to access specific spatial resources, 
such as a specific memory bank [3]. A good scheduler must be sensitive to these constraints 
in order to generate a good schedule. 

This paper considers the space-time scheduling problem from both a theoretical and prac- 
tical perspectives. We present two formulations of the scheduling problem as integer linear 
programing (ILP). The first formulation is for our general problem, but it uses a large num- 
ber of integer variables. The second formulation is a | approximation algorithm that applies 
when communication delays are small. We describe the LP-relaxation of the second formu- 
lation and derive a | approximation algorithm for the case of small communication delays. 
These formulations can be used to find the optimal solutions for small graphs. The solutions 
can then be used as baselines for evaluating more practical heuristics approaches. First, we 
present an 8-approximation algorithm whose running time is exponential in the number of 
tiles and communication delays, but is polynomial if these parameters are not constant. From 
the practical side, we present heuristics for estimating the minimum completion time of the 
subgraphs rooted at each instruction. These estimatations can be used as priority functions 
for existing list scheduling based algorithms, and they are more accurate than existing heuris- 
tics. Then we present several algorithms based on the these heuristics. Finally we introduce 
a partitioining idea in which we partition the precedence graph into some small layers. We 
implement these algorithms in Rawcc [13], the Raw instruction level parallelizing compiler, 
and we show that this algorithm compares favorably with Rawcc 's original algorithm. 

The rest of the paper is organized as follows. Section 2 defines the problem statement. Sec- 
tion 3 presents related work. Section 4 describes the two integer programming formulations of 
the problem. Section 5 introduces an 8-approximation algorithm that depends on the number 
of tiles and communcation delays exponentially, but is polynomial if these parameters are 
constant. Section 6 introduces several practical heuristic bounds and algorithms. Section 7.1 
describes a layer partitioning method and our final algorithm, which is a combination of pre- 



vious heuristics and this layer partitioning idea. Section 8 presents results of those algorithms 
on the Raw machine. Section 9 discusses future work and concludes. 

2 Problem Statement and Prelemaniries 

This section defines the problem statement. We adapt notations described in [7]. A scheduling 
problem is denoted by a|/3|7- a denotes the machine environment, e.g., P for parallel identical 
machines; (3 denotes various side constraints and characteristics, e.g., a list of precedence 
constraints prec. 7 denotes an optimality criterion. In our problem, 7 is C max , the maximum 
completion time, or makespan, of a set of instructions. 

We are given a set V of instructions and m clusters or tiles 1 on which these instructions 
should be executed. Each instruction i has processing time p^. A set of edges E defines the 
precedence-constraints in V. The graph G = (V,E) is a directed acyclic graph. For each 
precedence-constrained instruction pair (i,j) G E and pair of clusters (p,q), we define an 
associated non-negative delay h,j, p , q to be the minimum difference between the scheduled 
time of i and j, if i is scheduled on tile p and j is scheduled on tile q. The output is a schedule 
mapping from each instruction j to a tile q and a time t, such that for each edge (i, j) € E, 
if i is scheduled on tile p, it is scheduled at no later than time t — hj, p , q - Our objective is to 
find a proper schedule with minimum makespan (P\prec; li,j, p , q \C max ). 

Our model allows instructions to be preplaced on a specific cluster. We define F to be a 
mapping from instructions to clusters. The domain of F is the set of preplaced instructions. 
Each mapping i — > p in F specifies the constraint that instruction i must be mapped onto 
cluster p. 

We model h,j, p , q as the sum of two orthagonal components: an instruction dependent 
component and a cluster dependent component: hj, p , q = delay(i,j) + com(p,q). The delay 
delay(i, j) can be used to model pipelined functional units; for this use its value only depends 
on i. com(p,q) is used to model the cost of communicating between clusters. In different 
parts of this paper, we will consider different variants of the problem with restrictions and 
generalizations. One special case that we are interested in is when communcation delays are 
small compared to the length of the instructions; more precisely, communication delays are 
small if they are smaller than all instruction lengths. In many cases, the algorithm considers 
the preplaced instruction, but in some case it does not. 

Throughout the paper, m corresponds to the number of tiles, i,'s correspond to the starting 
time of the instruction i. We will also use p,'s for instruction lengths, com(p,q) for the 
communication delay between tile p and tile q and prec as the precedence graph. 

3 Related Works 

The most general problem for which a polynomial time algorithm is known was studied in [6] . 
If m and h,j, a ,b are constants, instructions are unit lengths, the precedence graph is a tree, 
and there are no preplaced instructions, Pm|tree; pj = 1; h,j, a ,b £ {0, 1, • • • ,D} |C max , the 
problem is solvable in polynomial time. When the precedence graph is a DAG instead of a 
tree, it is unknown whether the problem is NP-complete - in fact a special case of this problem 
is the well-known 3-tile scheduling problem (P3\prec; pj = l|C max ) whose complexity is open. 
Generalizing any other part of the above problem yields an NP-complete problem. 

Since most scheduling problems are NP-complete, approximation algorithms have been 
derived for them. In the presence of precedence delays, the best approximation factor for 
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the problem P\prec; delays\C m3x is 2 — ^ [19]. In the presence of communication delays, 
most studies have assumed that the delay is uniform: zero if two instructions are on the 
same cluster, and a constant c if two instructions are on different clusters. For c smaller than 
instruction execution times and no precedence (pipeline) delays, there is a |-approximation 
algorithm [17]. We will extend this algorithm to handle precedence delays as well. For larger 
communication delays, there is no known constant factor approximation algorithm. 

Many heuristics approaches have been developed for space-time scheduling. UAS (Unified 
Assign- and-Schedule) performs space-time scheduling on clustered VLIWs in a single step, 
using a greedy, list-scheduling-like algorithm [21]. Desoli describes an algorithm targeted for 
graphs with a large degree of symmetry [4] . Leupers describe an iterative combined approach 
to perform scheduling and partitioning on a VLIW DSP [15]. The approach is based on 
simulated annealing. 

However, there have been far fewer space-time scheduling algorithms that take into ac- 
count preplaced instructions. One such algorithm is BUG (Bottom-Up-Greedy). BUG is im- 
plemented on for ELI, one of the earliest spatial architectures that relies on the compiler for 
space-time scheduling [5]. BUG only performs space-scheduling; time-scheduling is done via 
traditional list scheduling. BUG is a two-phase algorithm: the algorithm first traverses a de- 
pendence graph bottom-up to propagate information about preplaced memory instructions. 
Then, it traverses the graph top-down and greedily map each instruction to the clusters that 
can execute it earliest. The Multifiow compiler uses a variant of BUG [16], but it does not 
account for preplaced instructions. Lee also handles preplaced instructions [13]. He borrows 
his general approach from multiprocessor task graph scheduling [13]. Like Ellis, Lee uses a 
separate list scheduler to perform time-scheduling. Space-scheduling is performed in three 
steps. Clustering groups together instructions that have little parallelism; merging reduces 
the number of clusters through merging; placement maps clusters to tiles. Constraints from 
preplaced instructions are mainly handled during placement. 



4 ILP Based Methods 

In this section, we study two kinds of integer linear programming formulation for our problem. 
One is a naive approach for formalizing our general problem; the other is a formulation for 
a special case that is useful to derive an approximation algorithm. The advantage of the 
first formulation is that it works for the general case, but the size of the ILP is large. The 
second formulation is for a special case, but its size is much smaller, and we can derive an 
approximation algorithm from its LP-relaxation. 

4.1 Time-Indexed and Interval-Indexed Integer Programming 

This section presents integer programming formulations for our general problem. The for- 
mulation are generalizations of the Time-Indexed and Interval-Indexed integer programming 
formulation for the P\prec; | ^w,C, problem studied in [9] and [10], generalized to account 
for communication and precedence delays. 

The idea is to associate a zero-one variable to each triple of instruction, tile, and time. 
Each such variable indicates whether an instruction is completed on the specific cluster at 
the specific time. This approach is called Time-Indexed Linear Programming. Here, variable 
yijt is equal to 1 if instruction i is completed on cluster j at time t. Variable Xi a is equal 
to if instruction i is scheduled on cluster a and 1 otherwise. We use T to represent the 
makespan(C max ), and M is an arbitrary, conservative upper bound of T. 



If Vijt's are zero-one variables, J2T=i tyijt is exactly equal to completion time of instruction 
i. Thus, for each instruction i, the completion time of instruction i must be less than T, i.e., 

M m 

Each instruction must be completed at a unique time and before makespan T, thus 

m M 

V 1 < i < n : ^2^2 Vijt = * 
From their definitions, a;, a and y, ai are related by the following: 

M 
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If (i,j) € £ and instruction i is scheduled on cluster a at time i, then the earliest time 
instruction j can be completed on cluster b is t + lij a t, : 

t t+Pj+lijab 

/ 4 Vias i %ia ?L / 4 Vjbs 

At each time 1 < t < T, there can be at most one instruction completed at that time per 
cluster, thus 

n t+pi — 1 

J2 J2 vn> ^ 1 

i—l s—t 

Therefor we have the following integer program: 

minimize T 

subject to V 1 < i < n : T> J2^Li YljLi t-Vijt 

V 1 < i < n : J2T=i T,Zi Vijt = 1 

VI <i < n, 1 <a <m : Y.t=i V^t = 1 - x ia 

V instructions i ->• j,p t <t <T - pj - l ijab , 1 < a, b < m : Y? s =i Vias + x ia > Ylltfi* iiah Vjbs 

Vl<j<m,l<i<T: ELiZliT 1 Vijs<l 

Vj,t: Vijt €{0,l},x ia G{0,1} 

This formulation can easily be extended to handle the preplaced instruction constraints 
in F. In order to do so, for a preplaced instruction i that is preplaced on tile k, the following 

equality holds: Y.t=i Vikt = 1- 

The problem with the above ILP-formulation is that the number of variables and inequal- 
ities is large. In order to decrease the number of variables and constraints, we use interval- 
indexed formulation [10]. Define Interval I to be the interval [(1+e)' -1 , (1+e)'] and r/ = (1+e)'. 
In the interval ILP-formulation, variable xm is 1 if instruction i is completed on cluster j in 
interval I. With this formulation, we can trade off the size of the LP with the optimality of the 
output schedule. For example, by setting e = 1, the number of variables is mnlg(T) instead 
of mnT, and the algorithm guarantees that the schedule it finds will be within a factor of two 
of optimal. 



4.2 Approximation Algorithm for Small Communication Delays 

This section presents an approximation algorithm for clusters with uniform communication 
delays, where the maximum communication delay is less than the minimum instruction length. 
We denote this problem as P\prec;pi] Uj a b\ com[i, j]=r; small r|C max . 

Our algorithm is based on the | approximation algorithm presented in [18] for the schedul- 
ing problem without precedence delays (delay(i,j)) and with small communication delays 
(com(a, b)). Their method is based on an LP-relaxation of the problem followed by an LP- 
rounding method. We first extend their result for unbounded number of clusters. Then, using 
the order of instructions in this schedule for unbounded number of clusters, we extend the 
previous |-approximation algorithm for bounded number of clusters in the presence of com- 
munication delay. We present the algorithm and sketch the proof of correctness for this more 
general case. 

First we model the Poo\prec;pi] Uj a b\ com[i, j] = r; small r|C max problem with an integer 
program. The term Poo means there is no constraint on the number of tiles. In the following 
integer linear program, T corresponds to the makespan(C max ), i,'s correspond to starting 
time of instruction i, and Xij 's indicate whether instruction j is scheduled in the same cluster 
as instructions i, with representing yes and 1 representing no. As before pi is the length of 
instruction i. r + (i) is the set of successors of node i, and r~(i) is the set of predecessors of 
node i. 

The makespan is greater than the completion time of all instructions, thus for each in- 
struction i, ti + pi <T. 

The fact that all communication delays are small along with the fact that there is at 
most one successor of i which is scheduled in the same tile as i immediately after i implies 
for at most one successor j of i, we have ti + pi + delay y - + r > tj and for the others 
ti +Pi + delay y +r < tj. Note that this works when communication delays are small, because 
in this case after completion of the instruction that is scheduled immediately after i all the 
other successors of i can be scheduled on the same tile as i, because their communication 
delay is smaller than the length of instruction j. Using x^, we can capture both cases by the 
following inequality: ti +pi + delay y - +XijT < tj. Similarly, there is at most one predecessor of 
i that is scheduled in the same tile as i immediately before i. These two facts are captured by 
these inequalities: for all (i, j) G E(G), £]jer+(i) x ij — ^ + (*) — 1 an ^ 12jer-(i) x ji — ^~(*) — 1 
(for more details please refer to [18]). We relax the integer constraints Xy € {0,1} to come 
up with the following linear program: 

minimize T 

subject to \/i G V : U+pi <T 

Vi£V : t t >0 

\f(i,j)€E(G) 

V(t,j)GB(G) 

V(t,j)GB(G) 

vi<;,j<|v 



ti + Pi + delay y + x„r < tj 
^2jer+(i) x ij > F (i) - 1 
z2jer-(i) x ii ^ F (V ~ 1 



< Xij < 1. 



Now suppose the optimal solution of above linear program is t° p ,T° pt ,x°j . From this 
solution, we compute integer values a y 's in this way: if x ^ < \, then a y - = 0, otherwise 
ctij = 1. It is easy to see that for each i the number of zero a^-'s is at most 1. We call j the 
favored successor of i iff ohj = 0. 

The following list scheduling heuristic algorithm uses this optimal solution to schedule 
instructions. Let t^ denote the starting time of instruction i determined by this list scheduling. 



Suppose R is the set of all available instructions at the current time. The algorithm increases 
time and at each time it decides which instructions to be scheduled on which clusters. 

— For current time S = 1 to T do. 

1. Let S be the subset of R that can be processed at time 6. 

2. Let Si and S2 be subsets of R such that S = Si U S2) 

• Si : (i G Si <£> Vj G r~(i) : tj < S — 1) =>■ all instructions i G Si can be executed 
at time S on new clusters. 

• S 2 : (i G S 2 <S> 3\k G r-(t) : t£ = 5 - 1 and Vj G r-(t)(j ^ *) : # < S - 1) 
(prec(i) := A;) if prec(ii)=prec(i2) = . . . =prec(ij) = k, 

Schedule all instructions in Si at time S in a different new cluster and for instruction 
k in S2 choose i a such that x° k p j is minimum and schedule i a in the same cluster as 
k. 

We claim that the output of the above scheduling algorithm is at most a factor of | of 
the optimum solution. 

The proof of above claim is based on the following lemma, which we state without proof: 

Lemma 1. For all (i,j) G E(G): pi + dij + ctijT < |(p, + dij + x°J Cy). 

Based on above lemma and with an induction argument and using the fact that the solution 
for linear program is a lower bound on the solution for integer program, we can prove the 
following theorem. 2 

Theorem 1. There exists a ^-approximation algorithm for the following problem, 
Poo\prec;pi;l ija b; com[i, j] = r; small r|C max . 

Now, in order to solve the problem P\prec\pi\ Uj a b\ com[ij] = c small|C max , we first solve 
the problem with unbounded number of clusters: Poo\prec;pi] Uj a b\ com[ij] = c small|C max . 
We call a successor j of i the favored successor if in the scheduling with unbounded number 
of clusters j is scheduled in the same cluster as i immediately after i. As before, we set 
ctij = if j is the favored successor of i and a^ = 1 otherwise. We use a variant of Graham's 
list-scheduling rule that takes communication delays into account. Suppose the completion 
time of instruction i in the schedule produced by the list scheduling algorithm is denoted by 
Cf. We define instruction j to be available at time t if t > Cf + d^ + (1 — ohj)t. Then the 
algorithm proceeds as follows: at each time we choose m arbitrary available instructions to 
be scheduled on these m clusters. 

Using above theorem and the fact that the solution for unbounded number of clusters is 
a lower bound on the solution for m clusters (similar to the method in [18]) we can prove 
the approximation factor of the above algorithm is at most |. Thus we have the following 
theorem: 

Theorem 2. There exists a ^-approximation algorithm for the following problem, P\prec;pi\ Uj a b\ com[i, j] = r; small 

2 Due to space constraints, we omit the proofs. Interested readers can refer to the similar proofs 
in [18]. 



5 Constant Factor Approximation 

This section presents a polynomial time constant factor approximation algorithm for the 
case of constant communciation delays com(i,j) = p and constant number of tiles. However 
the running time is exponential in terms of p, the advantage of this algorithm is that the 
approximation factor does not depend on p. We present this algorithm in the case of unit 
execution time. 

Definition 1. For all v € V(G), height(v), the longest path from a source, is defined induc- 
tively as follows: if indegree(v) = 0, height(v) = 1 . Otherwise, if v\ , v-i , . . . ,Vk are predecessors 
of v: height(v) = va&xi<i<k(height(vi) + 1). 

An edge (u,v) € E(G) in the precedence graph is called saturated iff height(v) = kp for 
any integer k. A scheduling of instructions on m tiles is called saturated scheduling iff for 
any saturated edge (u,v), starting time of instruction v is at least the completion time of u 
plus p. 

Theorem 3. If the makespan of the optimum scheduling is T and the makespan of the opti- 
mum saturated scheduling is T" , then T" < 2T. 

Proof. Consider the optimum scheduling OPT in which the makespan equals T and the 
starting time of instruction i is ti . We construct a saturated scheduling for which the starting 

time of instruction i, namely t' t , is at most 2i,. t\ is computed as follows: t\ = i,+ [ — iP- 

This new solution is saturated, because for a saturate edge (i,j), we know that tj > i$ + 1. 
Now, height (j) = kp implies tj =tj+kp> l+ti + (k—l)p+p = l+t' t +p. It is clear that ij's is a 
feasable solution. Using this fact that ti > height(i), it turns out that t\ < i, +height(i) < 2i,, 
thus the makespan of this new solution is at most 2T. 

Above theorem implies that it is sufficient to design a constant factor approximation 
algorithm for the optimum saturated scheduling problem. After removing all saturated edges, 
the graph is partitioned into some layers in each of them the longest path from any source 
node to any leaf is at most p. Scheduling each layer separately and putting p empty slots 
between them yeilds saturated scheduling. In order to schedule each layer, we observe that 
the longest path between any source and leaf node is at most p. Thus after putting a delay 
of length p on each edge, the length of the scheduling increases at most p * p = p 2 . If the 
number of instructions is less than rap 2 , then the optimum scheduling can be found using 
the brute-force method (for example, the integer programming formulation). Otherwise, the 
optimum makespan is at least ^- > p 2 , thus by adding p 2 to the makespan, we only lose a 
factor 1 in total makespan and still it is a 2-approximation. Scheduling this new precedence 
graph with p delay on each edge is easy because there is no communication delay and we can 
use the Graham list scheduling to get a 2-approximation algorithm??. From this discussion, 
we have the following theorem: 

Theorem 4. If p and m are bounded by constants, there exist an 8- approximation algo- 
rithm for the instruction scheduling problem in the case of uniform communication delays 
Pm\prec; delays com^ = p|C max . 

Proof. From above discussion, we can easily prove the approximation factor 8. In fact, we 
lose a factor of 2 in Theorem 3, a factor of 2 by adding p 2 to the makespan between different 
layers, and a factor of 2 from using list scheduling. 



6 Heuristic Algorithms 

In this section, we give several heuristic bounds and algorithms to solve our general problem. In 
subsection 6.1, we describe several bounds that are useful as a priority function in scheduling. 
In subsection 6.2, we describe some algorithms using these bounds. In section 7.1, we will 
introduce a partitioning idea of the precedence graph and will see using the combination of 
all these ideas, we can design an algorithm which has all advantages of these heuristics. In 
other words, using several ideas from theoretical point of view, we design an algorithm whose 
performance and running time is reasonable. While describing algorithms, we study their 
theoretical justification in terms of worst-case performance gaurantees. 

6.1 Heuristic Bounds 

In this section, we design recursive algorithms to find the following: lower bounds or an 
estimation for minimum remaining time of scheduling after starting one instruction; and the 
earliest time that one instruction can be scheduled in any proper scheduling. We will use 
these bounds to design heuristic algorithms. 
We first define some notations. 

Definition 2. Given a precedence graph G(V,E), for a node v G V(G), children ofv, namely 
CH(v), the subgraph above v, namelay GA(v), and the subgraph below v, namely GB(v) are 
defined as follows: 

— CH{v) = {u G V(G)\(v,u) G E(G)}. 

— GA(v) = {u G V(G)| there is a directed path from u to v in G}. 

— GB(v) = {u G V(G)| there is a diretted path from v to u in G}. 

In order to clarify algorithms, we will show their results on some sample precedence graphs. 
Three sample graphs are depicted in figure 2. We also consider a network of two tiles with 
communication delay 5 between them. In T\, T 2 and T 3 all instructions are unit length, all 
pipeline delays (delay(u,v)) are 2, and the only preplaced node is the bottom node in T 3 . T\ 
is called the fork graph and Ti and T3 are called join graph. 
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Fig. 2. Sample precedence graphs. 



First, we describe a simple bound that works only for a symmetric network of tiles. Then 
we extend this bound so that it works for general networks of tiles, where the bound for each 
tile is computed separately and it is suitable to handle preplaced instructions as well. 



A Bound for Symmetric Networks First we define the following property: 

Definition 3. Let Ci be the set of communication delays of a tile i to the other tiles. A 
symmetric network is a network of tiles satisfying the following property: for each pair of tiles 
a, b, C a = C\). Examples of symmetric networks include cycles and complete graphs. 

Given an instance of the problem for a symmetric network, our purpose is to find a lower 
bound on the remaining time of the scheduling after scheduling one instruction. In other 
words, we want to find a function mintime : V — > N such that in any schedule, if v is started 
at time t v the scheduling can not be completed before time mintime(w) + t v + p(v). 

This bound mintime (v) is simply equal to p(v), if v is a leaf. Otherwise, assume we have 
computed the mintime value for all children of node v, say v\,V2, ■ ■ ■ ,Vk- Now, assume we 
schedule v at time t on machine 1. The constraints of the rest of the scheduling are as fol- 
lows: 1) All Vi's should be scheduled after time t + p v , 2) instruction Vj can be scheduled on 
machine i after time t + delay(v,Vj) + com(l,i) + p v and 3) if instruction vi is completed 
at time C,, the scheduling may not be completed before time C, + mintime (vi). Now con- 
sider Di = mintime(wj) as the delivery time and r^ = delay(v,Vj) + com(l,i) + p v as the 
release time of instruction j on machine i. From above discussion, mintime(w) is at least 
the minimum solution to the following problem: Q\pj',rji\ max(C, + D t ) (Q for non-identical 
processors, r^ for release date of j on machine i and the problem is to minimize completion 
time plus delivery time). In general, this problem is NP-complete (by a simple reduction from 
PARTITIONING [14]). Our next purpose is to solve the problem for special cases. In the 
following, we present a greedy algorithm for the case in which p,'s (instruction lengths) are 
equal and delay(v,Vi) = delay(v,Vj) for all 1 < i, j < k, as in pipeline delays. 

Let C = {ci, C2, . . . , c m } be the set of m communication delays from a cluster to all other 
clusters. For convenience, we use p(v) instead of p v for length of instruction v. Let small(C) 
denote the smallest member of the set C. 

Algorithm GSM: Greedy algorithm for finding Simple Mintime 

Input: A precedence graph G. 

A network of tiles with communication delays between them. 

Pipeline delays between instructions. 

A node v. 
Output: mintime(w). 
begin 

1 if v is a leaf, mintime(w) = p(v) and return 

2 let vi, . . . ,Vk be children of node v (CH(v)), ordered in decreasing value of mintime(wi) + delay(v, Vi) 

3 // We compute 0\ , . . . , Ok and then mintime(w) as follows: 

4 let C = {ci,C2, . . . ,c m } = the set of m communication delays from a cluster to all other clusters. 

5 for i = 1 to k do 

6 let ct = small (C) 

7 Oi = mmtime(vi) + delay (v,Vi) + p(v) + ct 

8 ct = ct +p{vi) (update C) 

9 mintime(w) = max(Oi, O2, • • • ,0*). end 

The result of this algorithm for graph T\ is: mintime (vi) = p(vi) = 1 and 



mintime^)- ^ j + 8 ifk > 5 



k + 3 ifk<5 
^\+8ifk>5 

and the result for T\ is clearly mintime(w) = 1 and mintime^) = 3. 
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Lemma 2. If Pi's (instruction lengths) are the same and delay(v,Vi) = delay(v,Vj) for all 
1 < hj < ^j ^ en ^ e GSM gwes a lower bound on the remaining time after scheduling 
instruction v. 

Proof. The output of the above algorithm is a scheduling of all instructions, v\, . . . ,Vk, say 
OUTPUT. From above discussion, it is sufficient to prove that the output is the solution 
of the following minimization problem: Q\rji\pi\ max(C, + mintime(i)). Suppose for contra- 
diction that in the optimum solution, OPT, instructions are scheduled in a different order. 
Then, consider the first place that OPT is different from OUTPUT in which instruction j is 
scheduled on machine i in OUTPUT and instruction j' is scheduled on machine i in OPT. It 
is not hard to see that switching instructions i and i' in OPT yield another feasible scheduling 
because pipeline delays are the same and release times of different instructions are the same 
and the cost of the output is not increased, because this place is the first place in which OPT 
and OUTPUT are not the same and the instructions are sorted in decreasing order of their 
mintime value in the OUTPUT. This proves that we can modify the solution OPT and don't 
increase the cost of the solution until we get OUTPUT. 

This bound is more accurate than the longest path bound as a priority function to schedule 
instructions because it considers the width of the subgraph below a node i in the precedence 
graph in addition to its height. We will see that the list scheduling algorithm with this priority 
function is in fact a good approximation algorithm in several special cases. 

More General Bounds There are two problems with the simple bound above. Firstly, it 
can be used only on symmetric networks. Secondly, it does not take into account preplaced 
instructions. Preplaced instructions are instructions that must be placed on a specific tile. 
They arise when an instruction needs to access a specific resource that is only available on a 
tile, e.g., the memory back of a specific tile. These instructions force other close instructions to 
be scheduled on the same tile as well. In order to capture these more general problems and take 
into account different remaining time for different tiles, we need to separately compute the 
mintime for each potential tile. Now we want to find a tile-sensitive lower bound/estimation 
mintime (v,j) for instruction v and cluster j, as follows: 

Definition 4. For an instruction i and tile j, mintime(i,j) is a lower bound on the minimum 
remaining time of scheduling after starting instruciton i on cluster j. 

Similar to the previous discussion for the simple bound, it turns out that the mintime value 
can be captured as the optimum solution for the following scheduling problem: Q \r V j , p v \max(C v + 
D v j) where delivery times D v j = mintime(?;, j) and release dates r V j come from communi- 
cation and pipeline delays (precise discussion is in the previous section and we will see more 
details in the following algorithms) . Assuming the number of tiles and the out-degree of all 
vertices in the precedence graph are bounded by a constant, we describe how to find this 
lower bound in general. 

Algorithm RTM: Recursive algorithm to find Tile-sensitive Mintime 

— If v is a leaf, for all j mintime(w, j) = p(v). If v is preplaced on cluster k, for all clusters 
j other than k, mintime (v,j) = oo. 

— Otherwise, let v\, . . . ,Vk be the successors of node v. Again, if v is preplaced on tile k, 
for all clusters j other than k, mintime(w,j) = oo, otherwise we compute mintime(w, j) 
as follows: 
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• we consider all different tile assignments of v\ , v<i , . . . , Vk and for each assignment 
A = (ji , J2 , • ■ • ,3k), we compute value(^l) i.e., the minimum remaining time of v 
with this assignment, as follows: 

1. For each tile p, let S p = {u\ , . . . , u t } be the set of instructions assigned to p. Now, 
we compute the minimum value of C, +mintime(u,,p) where C, is the completion 
time of instruction u, in a scheduling of instructions {u\, ... ,u t } on cluster p 
such that instruction u, has a release date of com(j,p) + delay(w,w,) (In fact, 
this problem can be formalized as l\r ij] pi\M ax(C i + Wi) and we can solve it by 
assuming t, k and m are constants) Let O p be the output of this minimization 
problem. 

• valued) = max ie {i > ... , m } O, 

- mintime(«, j) = min all assignme nts.4 value (4) 

Notice that the above algorithm is polynomial time only if the out-degree of every node in 
the precedence graph and the number of processors are constants. For sample graphs T\ and 
T 2 , the result of RTM is the same as GSM. For T 3 , mintime(w, 1) = 1, mintime(w, 2) = oo, 
mintime(wj, 1) = 4 and mintime(wj,2) = 9. 

Earliest Scheduling Time on a Tile In addition to the preplaced instructions, a good 
bound for the remaining execution time should take into account the tile placement of the 
instructions that have been already scheduled. For this purpose, we define another bound, 
earlytime(u, j). Earlytime(w, j) is the earliest time that instruction v can be scheduled on 
tile j given the instructions that have already been scheduled. This bound is computed very 
similar to mintime(w, j), except that the recursive computation is done top-down instead of 
bottom-up. For breviry, we omit repeating the details. Similar to GSM there is an algorithm 
called GSE and similar to RTM there is RTE algorithm for computing earlytime instead 
of mintime values. Notice that in order to find earlytime using RTE, it is necessary that the 
in-degree and the number of processors are bounded by a constant. 

The output of RTE and GSE for T\ is the following: earlytime(v, j) = 1 and earlytime(w,, j) 
4 for 1 < j < 2. 

Now, we describe another property of this bound which is useful for justifying the final 
algorithm. 

Definition 5. Let G p be the induced subgraph of G on the vertices with eartlytime value 
not greater than p, i.e., V(G P ) = {v € V(G)\earlytime(v) < p}. Let eartlytime(G) = 
m&x{eartlytime(v)\v € V(G)} and G\G P is the the induced subgraph of G on vertices V(G) — 
V(G P ). 

Theorem 5. For any precedence graph G, eartlytime(G) > eartlytime(G\G p ) + p. 

Proof. We prove a stronger fact that is for all vertices v € V(G) — V(G P ), eartlytime G (w) > 
eartlytime G \ G (v)+p. The proof is by induction. The argument is clear for all vertices without 
predecessor in G\G P , because eartlytime G (w) > p and eartlytime G \ G (v) = 0. The eartlytime 
value of v in G and G\G P are the minimum value of two minimization problems of the form 
Q\ri;pi\ max(C, + fl„) in which all parameters are the same except the release times, r,'s. 
In one problem r, = eartlytime G (w,) and in the other one r' t = eartlytime G \ G (vi). Using 
induction hypothesis, one can see that r, > r[ + p. Consider the optimum solution for G\G p . 
By staring all instrutions p time slots sooner, we get a feasable solution for G, because all 
realease times are at least p units smaller than that of G\G p . Thus, if the cost of the optimum 
solution of the problem for r^'s is at least the cost of the optimum solution for r,'s plus p. It 
shows that eartlytime G (w) > eartlytime G \ G (v) + p. 
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6.2 Heuristic Algorithms Based on Above Bounds 

In this section, we use the above bounds to design heuristic algorithms. 

A Simple Algorithm Using Simple Bound First, we define the following: 

Definition 6. An instruction v is available if all its predecessors have been scheduled. The 
availability time of instruction v on tile j, denoted by avtime(v,j), is the earliest time t 
that instruction v can be scheduled on tile j, given the existing state of the schedule and all 
precedence and communication delays. 

The first observation is that simple mintime (w) is a good priority function for instructions 
that can be scheduled at the any time. When there are two or more candidates to be scheduled 
in a time slot, we should select the one with greater mintime(w). This idea leads us to the 
following simple list scheduling algorithm. We visit each scheduling slot (tile, time) in time 
order. If there are more than one instruction that can be scheduled in that slot, we select the 
one with the greatest mintime value. Thus, we have the following list scheduling algorithm: 

Algorithm LSSM: List Scheduling algorithm using Simple Mintime 

Input: 

— A precedence graph G. 

— A network of tiles with communication delays between them. 

— Pipeline delays between instructions. 

Output: A schedule of grpah G. 

— t = 

— Until there exists an unscheduled instruction do 

• t=t+l 

• for all tiles i do 

1. Let A be the set of unscheduled instructions that can be scheduled at time t on 
tile number i. Let instruction k be the instruction with maximum mintime value 
in the set A. Schedule k on cluster i at time t and update scheduling. 

This algorithm works well when communication delays are small relative to execution 
time of the instructions. The well-known Lawler's algorithm [8] is a special case of LSSM. In 
fact for the case of out-tree precedence graphs, unit execution time and unit communication 
delays, it is proved that the output of Lawler's algorithm is at most m f^- plus the optimum 
solution [8]. It is not hard to extend that proof in the case of small communication delays. 
However, this algorithm doesn't work well when communication delays are larger and more 
crucial as in Raw machine. It is better not to schedule an instruction in an empty slot on an 
inconvenient tile and it might be better to wait and schedule it sometimes later on a more 
appropriate tile. Particularly, in the case of preplaced instructions in the bottom part of the 
precedence graph, these constraints don't propagate to the top part. 

As for our sample graphs, LSSM works well for T\. For T 2 , one can see the problem of 
this algorithm when k = 2. LSSM schedules both v\ and v^ and the bottom node cannot be 
scheduled before time 8 whereas by scheduling v\ and v^ on tile 1 at time 1 and 2, the bottom 
node can be scheduled at time 5 on tile 1. LSSM has the same problem for T 3 as well. 
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Bottom-up Constraint Propagation The above algorithm works well when communica- 
tion delay is small relative to the execution time of instructions. When communication delay 
is larger than execution time, as in a spatial architecture, this algorithm works poorly. Intu- 
itively, the reason is that for such communication delay, often even if an instruction can be 
scheduled in an early slot, it may be better to schedule it in a later slot on a more convenient 
tile. 

In this section, we describe a scheduling algorithm based on mintime(w, j) ,i.e., the tile- 
sensitive mintime. (mintime and avtime are defined in Definitions 4 and 6). 

Definition 7. Instruction v is convenient for tile j, if according to the current situation of 
scheduling avtime(v,j) + mintime(v , j) < avtime(v,i) + mintime(v,i) for all tiles 1 < i < m. 

The algorithm is as follows: 

Algorithm LCTM: List scheduling of Convenient insturctions using Tile-sensitive 
Mintime 

Input: 

— A precedence graph G. 

— A network of tiles with communication delays between them. 

— Pipeline delays between instructions. 

Output: A schedule of grpah G. 

— Until there exists an unscheduled instruction do 

• Update availability times of instructions on tiles. 

• for all tiles i do 

1. Let A be the set of unscheduled instructions that are convenient for the tile 
number i. Let instruction k be the instruction with maximum mintime value in 
the set A. Schedule k on cluster i at time &vtime(v,i) and update scheduling. 

Note that as the algorithm proceeds there might be a situation in which there is an instruc- 
tion v that is available but not convenient for tile j, but after scheduling other instructions 
on the other tiles, v becomes convenient for tile j, because availability time of instruction v 
will be changed on the other tiles. Our algorithm proceeds as follows: in each step and for 
each tile, we check all instructions that are convenient for this tile and select the instruction 
with the highest priority i.e., the largest mintime value on this specific tile. 

The algorithm LCTM has the same problem as LSSM for the sample graph, T 2 , but it 
works better for T 3 because it considers the bottom preplaced node. In fact, if k < 5 it does 
not schedule any instruction on tile 2 as desired. As for large fc's, it uses both tiles which 
yeilds an optimum solution. 

With the above definition of convenient instructions and the strategy of selecting in- 
structions with larger mintime value, the above algorithm takes into account the following 
considerations: 

— It is better to schedule an instruction on a cluster with smaller mintime. 

— If there are more than one candidate instructions for an empty slot, the instruction with 
higher mintime value has the higher priority. 

— If it is impossible to schedule an instruction on a tile with smaller mintime value soon, it 
might be better to schedule this instruction on a less convenient tile much sooner. 
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Top-down Constraint Propagation The main flaw of the algorithm LCTM is that it 
doesn't take into account existing tile assignment of instructions. More precisely, we can de- 
scribe the problem with the following example: if instruction 3 is the successor of instructions 
1 and 2 and instruction 1 has been scheduled on tile a, then instruction 2 should not be sched- 
uled on a tile far from a, because in that case the earliest time of scheduling of instruction 3 
on any tile will be larger. In order to resolve this flaw in the algorithm, we define estimated 
time of scheduling when we schedule instruction v on the tile j. 

We can find the estimated makespan for assigning instruction v to the tile j, denoted by 
estimate(v , j) , as follows: 

Algorithm FE: Find Estimated time 

Input: Precedence graph G. 

A network of tiles with communication delays between them. 

Pipeline delays between instructions. 

node v and tile j. 

The current scheduling S, current mintime and earlytime values. 
Output: estimate(w, j). 

1 Assign instruction v to the tile j at its available time. 

2 for all unscheduled instructions u 6 GA(v) do 

3 compute earlytime(w) according to the new scheduling using Algorithm RTE. 

4 for all unscheduled instructions u € GA(v), value(w) = minjgi^... jm j mintime(M, i) + earlytime(M, 1 

5 estimate^, j) = max ungcheduled i nstruct i ons u valued) 
end 



Remark 1. According to the current situation of scheduling, estimate^, j) is the minimum 
makespan of any scheduling in which instruction v is scheduled on tile j . 

With the above definition of estimated time of scheduling, it is easy to see that the tile j 
with the smallest value of estimate^, j) is heuristicly the most convenient tile for scheduling 
v. Similar to the algorithm LCTM, we want to take into account this fact if we cannot 
schedule an instruction on a very convenient tile very soon, we would schedule it on a less 
convenient tile that is available much sooner. Thus, we define the new convenience property 
similar to the previous one as follows: 

Definition 8. Instruction v is convenient for tile j, if according to the current partial sched- 
ule avtime(v,j) + estimate(v , j) < avtime(v,i) + estimate(v,i) for all tiles i. 

The new algorithm LCTE is the same as the algorithm LCTM except that we re- 
place the old convenience definition with this new one. Note that estimate^, j) should be 
recomputed after scheduling each instruction. Thus the running time of LCTE is more than 
that of LCTM. However, the algorithm only needs to dynamically update the estimate^, j) 
attribute of available instructions. Suppose the running time of computing mintime(w, j) and 
earlytime(u, j)) is 0(M V ) and 0(E V ) respectively which are polynomials on the \GB(v) and 
|GA(i;)|. We find mintime(w, j) for each processor one time at the begining. In order to com- 
pute estimate^, j), we need to update earlytime(u,i) for all vertices in the subgraph above 
the vertex v, namely GA(v). Thus, computing estimate^, j) takes J2 ueGA r v \ i<i< m O(E u ) 
time. This is clearly a polynomial of n, if E v is a polynomial. As mentioned before, in our 
method M u and E v are polynomials if in-degrees, out-degrees and the number of processors 
are constants. With the above discussion, we conclude that the last list scheduling algorithm 
LCTE with the definition of convenient processors and estimate function is not practically 
implementable when the number of tiles, in-degrees, or out-degrees are large. In the following, 
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we will introduce a clustering idea by which we can solve all above problems and still get a 
good performance. 

7 Final Algorithm 

Here, we first present a layer partitioning idea by which we can use our heuristics on small 
graphs and then construct the whole solution by combining the solutions for all layers. We 
will prove that if the algorithm for small graphs has a good performance, this idea leads to an 
algorithm with a good performance for general graphs. Then we present our final algorithm 
which is an improvement of layer partitioning algorithm. 

7.1 Layer Partitioning 

In this subsection, we introduce a partitioning idea based on the earlytime bound. This algo- 
rithm is a warmup for our final algorithm. The idea is to schedule the graph G by scheduling 
iteratively the subgraphs G p . It means that our algorithm first partition the graph into layers 
of earlytime smaller than p, schedule each layer separately and put all these scheduling to- 
gether. The main point is to guarantee that the scheduling of layers are independent of each 
other. In the first algorithm, it is done by inserting a communication phase of C + 1 where C 
is the maximum communication delay between two instructions and tiles (C + 1 is sufficiently 
large and bounded by a constant). 

We assume in this section that we are given a scheduling algorithm A for the small graphs, 
i.e., graphs of earlytime smaller than p. Then we discuss that if p is bounded by a constant, 
these small graphs have the desired properties to use the heuristic LCTE. 

Algorithm PSSG: layer Partitioning and Scheduling Small Graphs 

Input: A precedence graph G, and a scheduling algorithm A for small graphs than p + 1. 
A network of tiles with communication delays between them. 
Pipeline delays between instructions. 



Output: A schedule of graph G. 
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Remark 2. According to the Theorem 5 and algorithm PSSG, it is straightforward that 
H«=i earlytime(Gj) < earlytime(G). 

First, we observe that if we select p > C then the length of the optimum schedule for 
each Gi is at least p > C. Now, we want to show that if we have an algorithm A with 
good performance for small graphs, algorithm PSSG has a a good performance as well. We 
formulate and prove it in the following theorem: 
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Theorem 6. Let L(G) = J2 v ev(G)P( v )- V algorithm A can schedule each small graph G, 

T ( C \ 

in at most a\ m \ + ftp + 7 earlytime(G ,) , then the PSSG can schedule any graph G in 
q; I I + Pp + (P + 7 + 1) early iime(G) , i.e., an (a + 2/3 + 7 + 1) -approximation. 

Proof. Suppose the graph G is partitioned into layers, G\,G2,Gz,. . . , G&. The makespan 
of the output of PSSG is less than or equal to M = a^Li \^§T~\ + ( k ~ l ) C + k @P + 
7^]j=i earlytime(Gj). First, 5Z i=1 | | = £(G) and according to the Remark 2, J2i=i earlytime(G,) < 

earlytime(G). Finally, — - + 1 > k, thus kp < earlytime(G) + p and p > C. From 

all these inequalities, it turns out that the output of PSSG is less than 



L(G) 

a\-^—^-\ + Bp + (8 + 7 + l)earlvtime(G) 

TO 

as desired. This is an (a + 2{3 + 7 + l)-approximation, because L(G), earlytime(G), and p are 
all lower bounds on the optimum solution. 

We conclude this section by defining small graphs formally and proving their properties. 

Definition 9. The precedence graph H is small if and only if for all vertices v € V(H), 
earlytime(v) < p and mintime(v) < p. 

Note that the above definition of small graphs is stronger than what we had in PSSG 
algorithm. Here, not only earlytime of every vertex is less than p, but also their mintime value 
is less than p. As we will see in the next chapter, it is not hard to partition the graph in small 
layers. 

Remark 3. For any small graph H and any vertex v € V(H), in-degree and out-degree of v is 
at most p. The reason is that in-degree(w) < earlytime(w) and out-degree(w) < mintime(w). 

The Remark 3 shows that by choosing an appropriate p (bounded by a constant) , the run- 
ning time of our heuristics in the previous section is polynomial, thus we can use Algorithm 
LCTE to schedule a small graph. 

7.2 Partitioning, Clustering and List Scheduling of Clustered Graph 

In this subsection, we present our final algorithm by combining all the ideas. From Section ??, 
we had several heuristics: LSSM that is suitable for small communication delays and LCTE 
that is applicable for graphs with small in and out-degrees, e.g., small graphs. From Section 
7.1, we know how to partition the graph into some layers of small graphs. In this section, we 
use this partitioning idea, However, instead of inserting C empty slots between layers, we use 
the output of small graph scheduling as a clustering, construct a new precedence graph from 
this clustering and then in the new precedence graph we use an algorithm like LSSM that is 
suitable for small communication delays. Although the provable worst-case performance ratio 
is the same as PSSG, the difference is clear from the practical results. 

In order to formalize our algorithm, we need to define clustering and clustered graph 
formally. 

Definition 10. Given a precedence graph G of instructions, a clustering is defined as a family 
of partial scheduling of instructions, namely S = {Si,S2, ■ ■ ■ ,S m } where S is a partitioning 
of all vertices of G and all instructions in 5, should be scheduled in a prespecified order on 
the same tile, i.e., there is a specific order for the instructions of Si and these instructions 
should be scheduled in this order. 
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Given a clustering of G, we want to construct a new precedence graph by putting a vertex 
for each cluster and an edge between two clusters if there was an edge between a vertex in 
the source cluster to a vertex in the destination. Then the delay between two clusters can be 
found using the delays between their vertices as follows: 

Definition 11. Given two clusters S = (si, S2, ■ ■ ■ ,s p ) and T = (<i,<2, • • • ,t q ) of G, if the 
completion time of instruction s, is C(si) and completion time of ti in T is C(ti), then 
the delay between S andT, namely delay(S,T), is defined as max( s . t)eE(G){delay(si,tj) — 
Cfa-J-CiaJ+Cfa)}. 

Intutively, the above delay is the minimum needed delay between clusters S and T. 

Definition 12. Given a precedence graph G, communication and pipeline delays delay(i,j) 
and com(i,j), and a clustering, Si,S2, ■ ■ ■ ,St, the clustered graph, namely CG is defined as 
follows: V(CG) = {Si,S2,- ■ ■ ,St} and (S,T) € E(CG) iff there exist vertices i € S and 
j € T such that (i,j) £ E(G) and delay(S,T) is defined in Definition 11. 

Similar to the definition of G p , we define G p as follows: V(G P ) = {v £ l / (G)|mintime(w) < 

This is our final algorithm: 

Algorithm PCLC: layer Partitioning, Clustering and List scheduling of Clustered graph 

Input: A precedence graph G, and a scheduling algorithm ^4(LCTE) for graphs of mintime smaller than p ■ 

A network of tiles with communication delays between them. 

Pipeline delays between instructions. 
Output: A schedule of graph G. 
begin 

1 let H = G 

2 let i = 

3 while H^O) 

4 let L = H p 

5 whileL ^ 

6 let i = i + 1 

7 let Gi = L" 

8 let H = H\H P 

9 let k = i, i = 1 

10 while i < k 

11 Use algorithm ^l(LCTE) to schedule Gi 

12 let Sij be the set of vertices in Gi that are scheduled on tile j in the output of *4(LCTE). 

13 let CG be the clustered graph corresponding to the clustering of Si's. 

14 Scheduling CG using LSSM 

15 //any scheduling suitable for small communication delays can be used here, like ILP-based ones. 
end 



Note that since p>C, there exist a cluster in each layer with execution time more than C. 
Thus, using LSSM is reasonable. Although we can not prove better performance guarantee 
for PCLC compared to PSSG, it is clear the output of PCLC is better than that of PSSG 
and using it in practice is more reasonable. 

We conclude this section by emphasizing that PCLC is based on the following facts: the 
heuristic algorithm LCTE algorithm for small graphs, a suitable algorithm (like the ILP- 
based algorithm or LSSM) for small communication delays, a layer partitioning method and 
schedule each layer separately and finally scheduling the clustered graph. 
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8 Experimental Results 

This section presents results of the algorithms described in Section 6. The algorithms are 
implemented in Rawcc, the instruction level parallelizing compiler for the Raw machine, 
which is a spatial architecture. 
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Fig. 3. The Raw machine. 



Experimental setup Experiments are performed on Beetle, a validated, cycle-accurate 
simulator of the Raw machine. Figure 3 shows a picture of the Raw machine [?]. The Raw 
machine comprises tiles organized in a two dimensional mesh. The actual Raw prototype has 
16 tiles in a 4x4 mesh. Each tile has its own instruction memory, processor pipeline, ALUs, 
data memory, and 28 registers. Its instruction set is based on Mips R4000. 

The tiles communicate with each other via point-to-point, mesh networks. In additional to 
a traditional, wormhole hole dynamic network, Raw has a programmable, compiler-controlled 
static network that is used to route scalar values between the register file/ ALUs on different 
tiles. Network ports are register mapped (the count of 28 registers do not include these network 
ports). Latency on the static network is three cycles for two neighboring tiles; each additional 
hop takes an extra cycle of latency. 

Our scheduling algorithms are implemented in Rawcc, the instruction level parallelizing 
compiler for the Raw machine. Rawcc takes a sequential C or Fortran program and parallelizes 
it across the Raw tiles. Each program is divided into one or more scheduling traces. For each 
trace, Rawcc constructs the data precedence graph for the instructions and performs space- 
time scheduling on the graph. After space-time scheduling is performed on all the scheduling 
regions, the code on each tile is run separately through a traditional register allocation [?]. 

Rawcc employs congruence transformation and analysis to increase and analyze the pre- 
dictability of memory references [3, 12]. This analysis creates memory reference instructions 
that must be placed on specific tiles. For dense matrix loops, the congruence pass usually 
unrolls the loops by the number of clusters or tiles. This unrolling also increases the size of 
the scheduling regions, so that no additional unrolling is necessary to expose parallelism. 

Sources of benchmarks include the Raw benchmark suite (Jacobi, Life) [2], Spec95 (Swim), 
and Nasa7 of Spec92 (Cholesky, Vpenta, and Mxm). Sha is an implementation of Secure Hash 
Algorithm. Fpppp-kernel is the inner loop of fpppp in Spec95 that accounts for 50% of the 
execution time. Some problem sizes have been changed to cut down on simulation time, but 
they do not effect the results qualitatively. 

Comparison We compare two algorithms in Section 6 with two existing space-time schedul- 
ing algorithms. The first existing algorithm is BUG, one of the earliest space-time scheduling 
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for instruction level parallelism. The second existing algorithm is the algorithm implemented 
in Rawcc described in [13]. This algorithm uses a three-phase algorithm to map instructions 
to tiles, then a separate list scheduling algorithm to order assign instructions to time slot. 
Our new algorithms are implemented in place of the space scheduling algorithms of Rawcc - 
Rawcc does its own time scheduling via list scheduling. 
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6490961 


1.210 


2.109 


4.150 


6.343 


1.632 


2.464 


3.938 


6.161 


Mxm 


1570607 


1.145 


2.086 


4.056 


7.186 


1.612 


2.491 


3.683 


5.066 


Swim 


83132395 


1.224 


1.831 


3.251 


5.658 


1.360 


2.245 


2.837 


4.868 


Sha 


955237 


1.180 


1.580 


1.704 


2.132 


1.470 


1.540 


1.726 


1.967 


Fpppp-kernel 


150395 


1.479 


2.675 


5.072 


6.073 


1.751 


2.563 


3.613 


4.113 



Table 1. Speedup of Rawcc's original algorithm and BUG. "Seq. Time" is the sequential run-time 
of each benchmark on a single tile. All speedups are measured relative to execution time on one tile. 





Mintime-based 


Estimate-based 


Benchmark 


N=2 


N=4 


N=8 


N=16 


N=2 


N=4 


N=8 


N=16 


Jacobi 


1.323 


2.318 


4.725 


7.794 


1.128 


2.089 


3.906 


7.131 


Life 


1.314 


2.276 


4.863 


6.749 


1.260 


2.331 


4.406 


8.061 


Cholesky 


1.408 


2.012 


3.624 


4.953 


1.120 


1.869 


3.123 


4.071 


Vpenta 


1.520 


2.306 


4.152 


6.860 


1.480 


2.471 


4.270 


6.573 


Mxm 


1.601 


2.197 


4.221 


8.500 


1.527 


2.193 


4.100 


7.478 


Swim 


1.536 


2.160 


4.446 


7.930 


1.425 


2.108 


4.101 


7.260 


Sha J 


1.051 


0.945 


1.032 


1.034 


1.333 


1.346 


1.552 


1.575 


Fpppp-kernel 


2.083 


2.005 


2.001 


2.023 


1.577 


2.698 


3.357 


4.779 



Table 2. Speedup of mintime-based and estimate-based algorithms. 



Table 1 shows the performance of Rawcc with the existing algorithms, and Table 2 presents 
the performance of our mintime-h&sed and estimate-based algorithms. Figure 4 displays the 
relative performance of each algorithm on 16 tiles. We note in passing that of the two ex- 
isting algorithms, Rawcc's original algorithm performs better than BUG. We find that BUG 
performs relatively poorly for two reasons: first, it does not take into account for constraints 
from preplaced instructions very well; second, it tends to traverse the graph depth first while 
making greedy decisions. The consequence of the depth first traversal is that the traversal 
tends to expose fine-grained parallelism before coarser-grained parallelism. As a result, un- 
necessary communication is often introduced. BUG was designed and evaluated on a spatial 
architecture whose functional units are connected via a single-cycle crossbar. It is less suitable 
for architectures with more costly communication, such as Raw. 
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fpppp-kemel 



EstimateTime 



Fig. 4. Performance improvement of the algorithms on 16 tiles, relative to Original Rawcc. 



For our new algorithms, we focus our attention on their performance relative to Rawcc's 
original algorithm. Results show that these new algorithms perform competitively overall, with 
an average speedup of 5.7 to 5.8 on 16 tiles. Closer examination reveal that the results can 
be roughly divided into two classes. Dense matrix applications include Jacobi, Life, Cholesky, 
Vpenta, Mxm, and Swim. These applications have unrolled loops that are highly regular. 
In addition, congruence analysis is able to identify many preplaced memory instructions. For 
these applications, preplaced instructions give very good guides about how instructions should 
be partitioned. The mintime and estimate based algorithms are designed to take advantage of 
this information, and they successfully make use of this information and achieve better results 
than the original algorithm. For these benchmarks, average improvements of our algorithms 
over the Rawss's original algorithm are 25% and 14%, respectively. Sha and Fpppp-kernel, 
however, are less regular and have no preplaced instructions. For these benchmarks, our 
algorithms perform worse. However, note that the estimate-based algorithm, which uses a 
relatively accurate completion time estimates that propagates scheduling information both 
upwards and downwards, is able to perform more than 75% better than mintime on the 
irregular benchmarks, at a cost of only 10% worse for the regular benchmarks. 

9 Conclusion and Future Works 

In this paper, we discussed different methods for solving the general instruction scheduling 
problem in a network of tiles. 

We discussed two different integer programming approaches, one of which is suitable to 
formalize our general problem and the other one with smaller size gives an approximation 
algorithm for a special case. In order to decrease the number of variables and constraints 
in the first formulation, one potential research is using ideas in [24]. Using LP-relaxation 
and LP-rounding method in order to design approximation algorithm based on this LP is 
an interesting theoretical problem that needs understanding the LP-relaxation better. The 
second formulation only works for the case of fully connected network of tiles and small 
communication delays. 

We described several heuristic bounds and algorithms based on them. Theoretical analysis 
of these algorithms would be a nice theoretical problem. No constant-factor approximation 
algorithm is known for the case of large communication delays. Finding a constant factor 
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approximation algorithm in this case or proving a lower bound of approximation is also an 
interesting theory problem. In the case of small communication delays the lower bound of | 
approximation factor is known but the best known approximation factor is |. 

We implement these algorithms separately. Using these bounds along with other previous 
algorithms, such as local optimization, might give us better practical results. Another appli- 
cation of aforementioned bounds is for branch and bound backtracking method and we can 
use them as a branch hint for searching state space in a more appropriate order and use these 
bounds as an estimation or a lower bound for the best solution that we can get from the 
current assignment. 

Spatial architectures are becoming increasingly important because they are a natural way 
to address the lack of scalability of wire delays with technology. Therefore, we believe that it 
is important to have a better understanding of the problem both theoretically and practically. 
This paper contributes toward this goal. 
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